Add YOLO26 object detection contrib model by jimburtoft · Pull Request #151 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-29T17:39:00Z

Summary

Adds Ultralytics YOLO26 object detection models (n/s/m/l/x, 2.4-58.9M params) as a contrib model for real-time inference on AWS Trainium2 and Inferentia2 via torch_neuronx.trace()
All 5 detection variants compile and run with high accuracy (CosSim 0.987-0.997), plus pose and OBB task heads
Neuron outperforms compiled A10G GPU by 1.4-4.5x on s/m/l/x variants at peak DP throughput

Validation

Validated on 4 configurations: trn2.3xlarge × {SDK 2.28, 2.29} and inf2.xlarge × {SDK 2.28, 2.29}.

Instance	SDK	Tests	yolo26n CosSim	yolo26s CosSim	yolo26n img/s	yolo26s img/s
trn2.3xlarge	2.28	13/13 pytest	0.9943	0.9931	32.3	66.0
trn2.3xlarge	2.29	13/13 pytest	0.9941	0.9931	33.2	65.5
inf2.xlarge	2.28	6/6 standalone	0.9965	0.9931	60.1	64.1
inf2.xlarge	2.29	6/6 standalone	0.9965	0.9931	69.7	76.7

Peak Throughput (trn2.3xlarge, LNC=1, DP=8)

Variant	Params	Dtype	img/s	vs A10G Compiled
YOLO26n	2.4M	FP32	272	0.13x
YOLO26s	10.0M	FP32	1,523	1.43x
YOLO26m	21.9M	BF16	1,267	2.67x
YOLO26l	26.3M	BF16	1,093	2.95x
YOLO26x	58.9M	BF16	876	4.49x

Files

contrib/models/YOLO26/
  README.md                          # Model card, benchmarks, compatibility matrix
  yolo26_neuron_notebook.ipynb       # Complete workflow notebook (tested end-to-end)
  src/
    __init__.py                      # Exports: YOLO26NeuronModel, compile_yolo26, etc.
    modeling_yolo26.py               # Trace wrapper, DP support, validation (~280 lines)
  test/
    __init__.py
    unit/__init__.py
    integration/
      __init__.py
      test_model.py                  # 13 integration tests (compile, accuracy, DP, perf)

Key Design Decisions

torch_neuronx.trace() (not NxDI model classes): YOLO26 is a CNN with no KV cache, no attention matrices, no token generation. All variants fit on a single NeuronCore (<180 MB NEFF). Data Parallelism provides throughput scaling.
end2end=False: topk/sort operations are not supported on Neuron (NCC_EVRF029). Raw [B, 84, 8400] output with CPU postprocessing (~0.1ms overhead).
BF16 for m/l/x: FP32 exceeds SB allocation for larger variants (NCC_IGCA030). n/s use FP32.
No --auto-cast flags: matmult autocast produces NaN for Conv2d-dominant models.
LNC-aware compilation: --lnc 1 compiler flag required when running on LNC=1 mode.

Target

aws-neuron/neuronx-distributed-inference main branch.

Ultralytics YOLO26 (n/s/m/l/x) on Trainium2 via torch_neuronx.trace(). All 5 detection variants plus pose and OBB task heads compile and run with high accuracy (CosSim 0.987-0.997). Peak throughput on trn2.3xlarge (LNC=1, DP=8): - YOLO26s: 1,523 img/s (1.43x vs A10G compiled) - YOLO26m: 1,267 img/s (2.67x vs A10G compiled) - YOLO26l: 1,093 img/s (2.95x vs A10G compiled) - YOLO26x: 876 img/s (4.49x vs A10G compiled) Includes modeling module, 13 integration tests (all passing), Jupyter notebook, and README with benchmarks.

Tested all 4 combinations: - trn2.3xlarge SDK 2.28: 13/13 pytest passed - trn2.3xlarge SDK 2.29: 13/13 pytest passed - inf2.xlarge SDK 2.28: 6/6 standalone tests passed - inf2.xlarge SDK 2.29: 6/6 standalone tests passed inf2 single-core throughput: yolo26n 60-70 img/s, yolo26s 64-77 img/s. Updated compatibility matrix and notebook prerequisites.

jimburtoft added 3 commits April 25, 2026 20:41

Validate all 5 YOLO26 variants on inf2.xlarge (SDK 2.29)

9ad5d38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add YOLO26 object detection contrib model#151

Add YOLO26 object detection contrib model#151
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
jimburtoft:contrib/yolo26

jimburtoft commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 29, 2026

Summary

Validation

Peak Throughput (trn2.3xlarge, LNC=1, DP=8)

Files

Key Design Decisions

Target

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant